right lower abdominal pain
Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale, Systematic Expert Evaluation and Practical Insights
Kim, Hyunjae, Sohn, Jiwoong, Gilson, Aidan, Cochran-Caggiano, Nicholas, Applebaum, Serina, Jin, Heeju, Park, Seihee, Park, Yujin, Park, Jiyeong, Choi, Seoyoung, Contreras, Brittany Alexandra Herrera, Huang, Thomas, Yun, Jaehoon, Wei, Ethan F., Jiang, Roy, Colucci, Leah, Lai, Eric, Dave, Amisha, Guo, Tuo, Singer, Maxwell B., Koo, Yonghoe, Adelman, Ron A., Zou, James, Taylor, Andrew, Cohan, Arman, Xu, Hua, Chen, Qingyu
Large language models (LLMs) are transforming the landscape of medicine, yet two fundamental challenges persist: keeping up with rapidly evolving medical knowledge and providing verifiable, evidence-grounded reasoning. Retrieval-augmented generation (RAG) has been widely adopted to address these limitations by supplementing model outputs with retrieved evidence. However, whether RAG reliably achieves these goals remains unclear. Here, we present the most comprehensive expert evaluation of RAG in medicine to date. Eighteen medical experts contributed a total of 80,502 annotations, assessing 800 model outputs generated by GPT-4o and Llama-3.1-8B across 200 real-world patient and USMLE-style queries. We systematically decomposed the RAG pipeline into three components: (i) evidence retrieval (relevance of retrieved passages), (ii) evidence selection (accuracy of evidence usage), and (iii) response generation (factuality and completeness of outputs). Contrary to expectation, standard RAG often degraded performance: only 22% of top-16 passages were relevant, evidence selection remained weak (precision 41-43%, recall 27-49%), and factuality and completeness dropped by up to 6% and 5%, respectively, compared with non-RAG variants. Retrieval and evidence selection remain key failure points for the model, contributing to the overall performance drop. We further show that simple yet effective strategies, including evidence filtering and query reformulation, substantially mitigate these issues, improving performance on MedMCQA and MedXpertQA by up to 12% and 8.2%, respectively. These findings call for re-examining RAG's role in medicine and highlight the importance of stage-aware evaluation and deliberate system design for reliable medical LLM applications.
- Europe > Austria > Vienna (0.14)
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > South Korea > Seoul > Seoul (0.04)
- (14 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
KnowMT-Bench: Benchmarking Knowledge-Intensive Long-Form Question Answering in Multi-Turn Dialogues
Chen, Junhao, Huang, Yu, Li, Siyuan, Yao, Rui, Li, Hanqian, Zhang, Hanyu, Li, Jungang, Chen, Jian, Wang, Bowen, Hu, Xuming
Multi-Turn Long-Form Question Answering (MT-LFQA) is a key application paradigm of Large Language Models (LLMs) in knowledge-intensive domains. However, existing benchmarks are limited to single-turn dialogue, while multi-turn dialogue benchmarks typically assess other orthogonal capabilities rather than knowledge-intensive factuality. To bridge this critical gap, we introduce \textbf{KnowMT-Bench}, the \textit{first-ever} benchmark designed to systematically evaluate MT-LFQA for LLMs across knowledge-intensive fields, including medicine, finance, and law. To faithfully assess the model's real-world performance, KnowMT-Bench employs a dynamic evaluation setting where models generate their own multi-turn dialogue histories given logically progressive question sequences. The factual capability and information delivery efficiency of the \textit{final-turn} answer are then evaluated using a human-validated automated pipeline. Our experiments reveal that multi-turn contexts degrade performance: factual capability declines due to the contextual noise from self-generated histories, while information efficiency drops as models become more verbose with increasing dialogue length. We then investigate mitigation strategies, demonstrating that retrieval-augmented generation (RAG) can effectively alleviate and even reverse this factual degradation. These findings underscore the importance of our benchmark in evaluating and enhancing the conversational factual capabilities of LLMs in real-world knowledge-intensive applications. Code is available at \href{https://github.com/hardenyu21/KnowMT-Bench}{\textcolor{cyan}{\texttt{KnowMT-Bench}}}.
- North America > United States (1.00)
- Europe > United Kingdom > Northern Ireland (0.04)
- Asia > China > Hong Kong (0.04)
- (3 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.92)
- Law (1.00)
- Information Technology (1.00)
- Health & Medicine > Therapeutic Area > Obstetrics/Gynecology (1.00)
- (11 more...)
K-QA: A Real-World Medical Q&A Benchmark
Manes, Itay, Ronn, Naama, Cohen, David, Ber, Ran Ilan, Horowitz-Kugler, Zehavi, Stanovsky, Gabriel
Ensuring the accuracy of responses provided by large language models (LLMs) is crucial, particularly in clinical settings where incorrect information may directly impact patient health. To address this challenge, we construct K-QA, a dataset containing 1,212 patient questions originating from real-world conversations held on K Health (an AI-driven clinical platform). We employ a panel of in-house physicians to answer and manually decompose a subset of K-QA into self-contained statements. Additionally, we formulate two NLI-based evaluation metrics approximating recall and precision: (1) comprehensiveness, measuring the percentage of essential clinical information in the generated answer and (2) hallucination rate, measuring the number of statements from the physician-curated response contradicted by the LLM answer. Finally, we use K-QA along with these metrics to evaluate several state-of-the-art models, as well as the effect of in-context learning and medically-oriented augmented retrieval schemes developed by the authors. Our findings indicate that in-context learning improves the comprehensiveness of the models, and augmented retrieval is effective in reducing hallucinations. We make K-QA available to to the community to spur research into medically accurate NLP applications.
- South America > Brazil (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Indiana > Hamilton County > Carmel (0.04)
- (3 more...)